Heatmap function

create_gene_heatmap <- function(de_results_df, dge_object, font_size = 12, num_genes = 20) {
  # Subset the genes
  top_genes <- de_results_df$Symbol[1:num_genes]
  subset_matrix <- dge_object$logCPM[top_genes, colnames(dge_object$counts)]
  
  # Create annotation dataframe
  sample_info <- dge_object$samples
  annotation_df <- data.frame(
    Timepoint = sample_info$Timepoint,
    Donor = sample_info$Donor,
    Subset = sample_info$Subset
  )
  rownames(annotation_df) <- colnames(subset_matrix)
  
  # Reorder gene expression matrix
  ordered_samples <- order(sample_info$Timepoint_day, sample_info$Subset, sample_info$Donor)
  subset_matrix <- subset_matrix[, ordered_samples]
  sample_info <- sample_info[ordered_samples, ]
  
  # Create the heatmap
  heatmap <- pheatmap(subset_matrix,
           scale = "row",
           cluster_rows = FALSE,
           cluster_cols = FALSE,
           color = viridis(50),
           show_colnames = FALSE,
           annotation_col = annotation_df,
           fontsize = font_size,
           silent = FALSE)
  }

TIRE-seq T cell human de splines timecourse

The aim of this notebook is to identify genes that change over time upon T cell stimulation.

Recap

I sequenced these samples with Prime-seq on extracted RNAs already. Results were not that great. Here reprocess the same samples with TIRE-seq for a head to head comparison.

Samples

  • Yasmin Nouri plates.
  • All wells have 10,000 cells in 50ul total volume.
  • Comprised of 25ul cells in media and 25ul 2x buffer TCL
  • The cell concentration is therefore 200 cells/uL.
  • So 20uL input will be 4000 cells.

Lab notes

The processing went as planned. Full writeup available at ELN link

Read data

This was generated in 2A_clustering

Workflow is from https://hackmd.io/@9RO7DTzsQ_2CaEAsT6y6NA/Bk0pW5gZn#Exercise-2---Cubic-Splines

This is a more relevant analysis strategy for timecourse experiments where we fit a polynominal curve to each timepoint.

sce <- readRDS(here::here(
     "data/TIRE_Tcell/SCEs/tcell_cluster.sce.rds"
))

dge <- scran::convertTo(sce, type="edgeR")
tb <- as_tibble(dge$samples)

Check on the perturbations conducted in this experiment:

tb %>% 
  dplyr::count(Subset, Donor, Timepoint) %>% 
  arrange(Timepoint) %>% 
  head(10)

Filter for genes that have at least 5 counts.
Currently keep 8,249

keep <- filterByExpr(dge, group=dge$samples$Timepoint, min.count=5)
dge <- dge[keep,]
summary(keep)
##    Mode   FALSE    TRUE 
## logical   12666    8249

Correct for composition biases by computing normalization factors with the trimmed mean of M-values method.

dge <- calcNormFactors(dge)

Check the MDS plot.
Day 0, 1 and 2 the times with the most difference.

limma::plotMDS(dge, labels=dge$samples$Timepoint)

Perform differential expression testing

Need to recode the timepoint to a numeric for the spline fitting to occur.

dge$samples$Timepoint_day <- as.numeric(str_split(
  string = dge$samples$Timepoint, pattern = "_", simplify = TRUE)[,2]
)

Construct the design matrix

Use a cubic regression spline curve with 3 degrees of freedom.
According to the edgeR manual this is a good place to start.

Checked a few different degrees of freedom and 4 fits the data well

splines <- ns(dge$samples$Timepoint_day, df = 4)

design <- model.matrix(~splines + Donor + Subset, dge$samples)
#design <- model.matrix(~splines + Donor + Subset, dge$samples)

head(design)
##            (Intercept)   splines1    splines2  splines3    splines4
## AAAGACTTCG           1 0.00000000  0.00000000 0.0000000  0.00000000
## AAAGTGTGTG           1 0.65304487 -0.01266043 0.2267403 -0.12754144
## AACGAGCCTG           1 0.71153846  0.15028422 0.1253577 -0.07051369
## AACGTCAGCC           1 0.00000000  0.00000000 0.0000000  0.00000000
## AAGATTTGCC           1 0.00976801  0.20863621 0.3397926  0.44180316
## ACAAAGAGAG           1 0.00000000 -0.16321244 0.3730570  0.79015544
##            DonorDonor_64 SubsetCD8
## AAAGACTTCG             1         0
## AAAGTGTGTG             1         1
## AACGAGCCTG             0         1
## AACGTCAGCC             0         1
## AAGATTTGCC             0         1
## ACAAAGAGAG             0         0

The four spline coefficients do not have any particular meaning. Hypothesis testing would only make sense if the three coefficients are assessed together. The advantage of using a cubic spline curve is that it provides more stable fit at the end points compared to a polynomial.

Fit BCV

Compared to the edgeR example on timecourse analysis the biological variation is pretty low.

par(mfrow=c(1,2))

dge <- estimateDisp(dge, design)
summary(dge$trended.dispersion)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.04159 0.04381 0.04962 0.05177 0.05700 0.07432
plotBCV(dge)

fit <- glmQLFit(dge, design, robust=TRUE)
summary(fit$var.prior)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3460  0.7951  0.8640  0.8726  0.9391  1.4263
plotQLDisp(fit)

Timecourse analysis

In a timecourse experiment, we are looking for genes that change expression level over time. Here, the design matrix uses 3 natural spline basis vectors to model smooth changes over time, without assuming any particular pattern to the trend.
We test for a trend by conducting F-tests on 3 df for each gene:

fit <- glmQLFTest(fit, coef=2:5)

The topTags function lists the top set of genes that vary the most over time.

results <- as_tibble(as.data.frame(
  topTags(fit, n=length(fit$genes$Symbol))
  ))
results[results$FDR < 0.1,]

The total number of genes with significant (5% FDR) changes at different doses can be examined with decideTests.

There are 8,675 differentially expressed genes across timepoints

summary(decideTests(fit))
##        splines4-splines3-splines2-splines1
## NotSig                                1579
## Sig                                   6670

Note that all three spline coefficients should be tested together in this way. It is not meaningful to replace the F-tests with t-tests for the individual coefficients, and similarly the logFC columns of the top table do not have any interpretable meaning.
The trends should instead be interpreted by way of trend plots, as we show now.
Finally, we visualize the fitted spline curves for the top four genes. We start by computing the observed and fitted log-CPM values for each gene:

top_genes <- head(results$Symbol,9)

logCPM.obs <- as.data.frame(cpm(dge, log=TRUE, prior.count=fit$prior.count))
logCPM.obs$ID <- row.names(logCPM.obs)
logCPM.fit <- as.data.frame(cpm(fit, log=TRUE))
logCPM.fit$ID <- row.names(logCPM.fit)

# Mung the cpm columns into a single tibble
lcm.obs <- pivot_longer(logCPM.obs[top_genes,], cols = -ID, names_to = "sample", values_to = "count") %>% 
  dplyr::rename(obs_cpm = count)
lcm.fit <- pivot_longer(logCPM.fit[top_genes,], cols = -ID, names_to = "sample", values_to = "count") %>% 
  dplyr::rename(fit_cpm = count)

lcm.obs$fit_cpm <- lcm.fit$fit_cpm

Add back the sample metadata

lcm <- left_join(lcm.obs, dge$samples[,c("Well_BC", "Timepoint_day", "Subset", "Donor")],
                 by=c("sample" = "Well_BC"))

Visulaise results

Set up the gene expression matrix to view a heatmap of early timepoints

dge$logCPM <- cpm(dge, log=TRUE, prior.count=1)
dge_early <- dge[,dge$samples$Timepoint_day <= 5]

Most significant changes over time

Generate the plot linear x axis. Both genes go up and down in this plot

plt3 <- ggplot(lcm) +
  geom_point(aes(y = obs_cpm, x=Timepoint_day, colour=Subset)) +
  scale_colour_Publication() +
  geom_smooth(aes(y = fit_cpm, x=Timepoint_day), method = "loess", se = FALSE, color = "red") +
  xlab("Timepoint_day") + ylab("logCPM") +
  theme_Publication() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
  facet_wrap(~ ID, scales = "free_y", labeller = labeller(ID = label_value)) +
  theme(strip.text = element_text(face = "bold.italic"))

plt3

Prepare a heatmap of the the genes that change during the timecourse

create_gene_heatmap(results, dge_early, font_size = 12, num_genes = 20)

Subset for genes that go up upon stimulation

Focus on genes that go up upon stimulation with CD28 and CD3 beads.

up <- results %>% 
  filter(FDR < 0.05) %>% 
  filter(logFC.splines2 > 0)

up_genes <- head(up$Symbol,4)

logCPM.obs <- as.data.frame(cpm(dge, log=TRUE, prior.count=fit$prior.count))
logCPM.obs$ID <- row.names(logCPM.obs)
logCPM.fit <- as.data.frame(cpm(fit, log=TRUE))
logCPM.fit$ID <- row.names(logCPM.fit)

# Mung the cpm columns into a single tibble
lcm.obs <- pivot_longer(logCPM.obs[up_genes,], cols = -ID, names_to = "sample", values_to = "count") %>% 
  dplyr::rename(obs_cpm = count)
lcm.fit <- pivot_longer(logCPM.fit[up_genes,], cols = -ID, names_to = "sample", values_to = "count") %>% 
  dplyr::rename(fit_cpm = count)

lcm.obs$fit_cpm <- lcm.fit$fit_cpm

lcm <- left_join(lcm.obs, dge$samples[,c("Well_BC", "Timepoint_day", "Subset", "Receptor", "Donor")],
                 by=c("sample" = "Well_BC"))

Visualise the results

plt4 <- ggplot(lcm) +
  geom_point(aes(y = obs_cpm, x=Timepoint_day, colour=Subset)) +
  scale_colour_Publication() +
  geom_smooth(aes(y = fit_cpm, x=Timepoint_day), method = "loess", se = FALSE, color = "red") +
  xlab("Timepoint_day") + ylab("logCPM") +
  theme_Publication() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
  facet_wrap(~ ID, scales = "free_y", labeller = labeller(ID = label_value)) +
  theme(strip.text = element_text(face = "bold.italic"))

plt4

Prepare a heatmap of the the genes that go up.

create_gene_heatmap(up, dge_early, font_size = 12, num_genes = 20)

Subset for genes that go down upon stimulation

This is a little less clear because the cells are thawed from liquid nitrogen so start off in a rough state.

down <- results %>% 
  filter(FDR < 0.05) %>% 
  filter(logFC.splines2 < 0)

down_genes <- head(down$Symbol,4)

logCPM.obs <- as.data.frame(cpm(dge, log=TRUE, prior.count=fit$prior.count))
logCPM.obs$ID <- row.names(logCPM.obs)
logCPM.fit <- as.data.frame(cpm(fit, log=TRUE))
logCPM.fit$ID <- row.names(logCPM.fit)

# Mung the cpm columns into a single tibble
lcm.obs <- pivot_longer(logCPM.obs[down_genes,], cols = -ID, names_to = "sample", values_to = "count") %>% 
  dplyr::rename(obs_cpm = count)
lcm.fit <- pivot_longer(logCPM.fit[down_genes,], cols = -ID, names_to = "sample", values_to = "count") %>% 
  dplyr::rename(fit_cpm = count)

lcm.obs$fit_cpm <- lcm.fit$fit_cpm

lcm <- left_join(lcm.obs, dge$samples[,c("Well_BC", "Timepoint_day", "Subset", "Receptor", "Donor")],
                 by=c("sample" = "Well_BC"))
plt5 <- ggplot(lcm) +
  geom_point(aes(y = obs_cpm, x=Timepoint_day, colour=Subset)) +
  scale_colour_Publication() +
  geom_smooth(aes(y = fit_cpm, x=Timepoint_day), method = "loess", se = FALSE, color = "red") +
  xlab("Timepoint_day") + ylab("logCPM") +
  theme_Publication() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
  facet_wrap(~ ID, scales = "free_y", labeller = labeller(ID = label_value)) +
  theme(strip.text = element_text(face = "bold.italic"))

plt5

Prepare a heatmap of the the genes that change during the timecourse.
This is really not clear in the heatmap visulaisation.

create_gene_heatmap(down, dge_early, font_size = 12, num_genes = 20)

Differential expression analysis day 1 vs day 2

This is where the majority of the changes are taking place based on the spline analysis

Conclusion

  • Some of the genes make sense. Many of the typical immune candidates are not expressed highly enough to come up.
  • Most of the time sensitive genes decrease over time. This is probably related to the first timepoint being cells thawed from cyropreservation.

Session info

sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Red Hat Enterprise Linux 9.4 (Plow)
## 
## Matrix products: default
## BLAS:   /stornext/System/data/software/rhel/9/base/tools/R/4.4.1/lib64/R/lib/libRblas.so 
## LAPACK: /stornext/System/data/software/rhel/9/base/tools/R/4.4.1/lib64/R/lib/libRlapack.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Australia/Melbourne
## tzcode source: system (glibc)
## 
## attached base packages:
##  [1] grid      splines   stats4    stats     graphics  grDevices utils    
##  [8] datasets  methods   base     
## 
## other attached packages:
##  [1] scales_1.3.0                ggthemes_5.1.0             
##  [3] here_1.0.1                  knitr_1.48                 
##  [5] viridis_0.6.5               viridisLite_0.4.2          
##  [7] patchwork_1.3.0             pheatmap_1.0.12            
##  [9] edgeR_4.2.2                 limma_3.60.6               
## [11] scater_1.32.1               scuttle_1.14.0             
## [13] lubridate_1.9.3             forcats_1.0.0              
## [15] stringr_1.5.1               dplyr_1.1.4                
## [17] purrr_1.0.2                 readr_2.1.5                
## [19] tidyr_1.3.1                 tibble_3.2.1               
## [21] ggplot2_3.5.1               tidyverse_2.0.0            
## [23] SingleCellExperiment_1.26.0 SummarizedExperiment_1.34.0
## [25] Biobase_2.64.0              GenomicRanges_1.56.2       
## [27] GenomeInfoDb_1.40.1         IRanges_2.38.1             
## [29] S4Vectors_0.42.1            BiocGenerics_0.50.0        
## [31] MatrixGenerics_1.16.0       matrixStats_1.4.1          
## 
## loaded via a namespace (and not attached):
##  [1] gridExtra_2.3             rlang_1.1.4              
##  [3] magrittr_2.0.3            compiler_4.4.1           
##  [5] mgcv_1.9-1                DelayedMatrixStats_1.26.0
##  [7] vctrs_0.6.5               pkgconfig_2.0.3          
##  [9] crayon_1.5.3              fastmap_1.2.0            
## [11] XVector_0.44.0            labeling_0.4.3           
## [13] utf8_1.2.4                rmarkdown_2.28           
## [15] tzdb_0.4.0                UCSC.utils_1.0.0         
## [17] ggbeeswarm_0.7.2          bluster_1.14.0           
## [19] xfun_0.48                 zlibbioc_1.50.0          
## [21] cachem_1.1.0              beachmat_2.20.0          
## [23] jsonlite_1.8.9            highr_0.11               
## [25] DelayedArray_0.30.1       BiocParallel_1.38.0      
## [27] cluster_2.1.6             irlba_2.3.5.1            
## [29] parallel_4.4.1            R6_2.5.1                 
## [31] bslib_0.8.0               stringi_1.8.4            
## [33] RColorBrewer_1.1-3        jquerylib_0.1.4          
## [35] Rcpp_1.0.13               igraph_2.0.3             
## [37] Matrix_1.7-0              timechange_0.3.0         
## [39] tidyselect_1.2.1          rstudioapi_0.17.0        
## [41] abind_1.4-8               yaml_2.3.10              
## [43] codetools_0.2-20          lattice_0.22-6           
## [45] withr_3.0.1               evaluate_1.0.1           
## [47] pillar_1.9.0              generics_0.1.3           
## [49] rprojroot_2.0.4           hms_1.1.3                
## [51] sparseMatrixStats_1.16.0  munsell_0.5.1            
## [53] glue_1.8.0                metapod_1.12.0           
## [55] tools_4.4.1               BiocNeighbors_1.22.0     
## [57] ScaledMatrix_1.12.0       locfit_1.5-9.10          
## [59] scran_1.32.0              colorspace_2.1-1         
## [61] nlme_3.1-164              GenomeInfoDbData_1.2.12  
## [63] beeswarm_0.4.0            BiocSingular_1.20.0      
## [65] vipor_0.4.7               cli_3.6.3                
## [67] rsvd_1.0.5                fansi_1.0.6              
## [69] S4Arrays_1.4.1            gtable_0.3.5             
## [71] sass_0.4.9                digest_0.6.37            
## [73] dqrng_0.4.1               SparseArray_1.4.8        
## [75] ggrepel_0.9.6             farver_2.1.2             
## [77] htmltools_0.5.8.1         lifecycle_1.0.4          
## [79] httr_1.4.7                statmod_1.5.0